feat(orchestrator): add penalize action for gibberish/repetition filters by anravich13-cloud · Pull Request #2775 · PrimeIntellect-ai/prime-rl

anravich13-cloud · 2026-06-11T19:11:26Z

Adds an opt-in penalize action for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (penalty_reward, default -1.0) before advantage computation, creating negative training signal without dropping the rollout.

Filter configs gain action (monitor|drop|penalize) + penalty_reward; legacy enforce configs still parse (true→drop, false→monitor), conflicting combinations raise a validation error.
Filters gain a phase: gibberish/repetition are pre_advantage, zero_advantage is post_advantage. TrainSink.process_group now applies pre-advantage filters before assign_advantages so penalized rewards are visible to the group baseline (and to sample.reward propagation).
Penalties preserve the original reward (rollout.raw_reward) and record per-filter metadata (rollout.reward_penalties); both flow into saved rollouts via to_dict.
New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min} when penalties fire.
Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

Note

Medium Risk
Changes GRPO training signal timing (reward caps before advantages) and filter config semantics; defaults are preserved but misconfigured penalize could skew advantages.

Overview
Rollout filters now use action (monitor | drop | penalize) instead of a boolean enforce, via shared BaseFilterConfig with resolved_action, penalty_reward, and validation when action and enforce disagree. penalize caps reward with min(raw, penalty_reward), keeps rollouts trainable, and records raw_reward / reward_penalties on TrainRollout (surfaced in saved rollouts).

TrainSink.process_group runs pre-advantage filters (gibberish/repetition) before assign_advantages, then post-advantage filters (zero advantage); pre-batch drop stats count only drop actions. process_batch re-syncs sample.reward after post-batch penalize. Metrics add penalized rates and raw_reward/all/* when penalties occur.

Defaults stay monitor for gibberish/repetition and drop for zero advantage; legacy enforce still maps to monitor/drop.

^{Reviewed by Cursor Bugbot for commit 85b696e. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds an opt-in `penalize` action for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (`penalty_reward`, default -1.0) before advantage computation, creating negative training signal without dropping the rollout. - Filter configs gain `action` (monitor|drop|penalize) + `penalty_reward`; legacy `enforce` configs still parse (true→drop, false→monitor), conflicting combinations raise a validation error. - Filters gain a phase: gibberish/repetition are pre_advantage, zero_advantage is post_advantage. TrainSink.process_group now applies pre-advantage filters before assign_advantages so penalized rewards are visible to the group baseline (and to sample.reward propagation). - Penalties preserve the original reward (rollout.raw_reward) and record per-filter metadata (rollout.reward_penalties); both flow into saved rollouts via to_dict. - New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min} when penalties fire. - Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.}

A penalize filter in post_batch_filters caps rollout.raw['reward'] in process_batch, after process_group already propagated reward onto the trainer-bound TrainingSamples. Re-stamp sample.reward for penalized rollouts so shipped samples agree with the rollout reward used in metrics. Advantage is intentionally untouched: post-batch runs after advantage computation, so a penalty there is metadata-only.

Covers the sink paths the filter unit tests can't reach: - process_group applies pre-advantage penalize before the group baseline and stamps post-penalty reward/advantage onto trainer-bound samples; - an equally-penalized group collapses to zero advantage and is dropped by the post-advantage zero_advantage filter (drop attribution excludes the penalty filter); - process_batch re-syncs sample.reward after a post-batch penalize (regression test for stale TrainingSample rewards); - post-batch drop still excludes samples from the shipped batch. Bypasses tokenizer/renderer by pre-building rollout.samples and driving process_group / process_batch directly.

anravich13-cloud added 2 commits June 9, 2026 23:05

style: ruff format metrics.py

695cdf1

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/train_sink.py

anravich13-cloud added 2 commits June 11, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): add penalize action for gibberish/repetition filters#2775

feat(orchestrator): add penalize action for gibberish/repetition filters#2775
anravich13-cloud wants to merge 4 commits into
mainfrom
feat/repetition-gibberish-penalty

anravich13-cloud commented Jun 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anravich13-cloud commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anravich13-cloud commented Jun 11, 2026 •

edited by cursor Bot

Loading